categorical column
Fed-TGAN: Federated Learning Framework for Synthesizing Tabular Data
Zhao, Zilong, Birke, Robert, Kunar, Aditya, Chen, Lydia Y.
Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. Recently, federated learning (FL) is an emerging paradigm that features decentralized learning on client's local data with a privacy-preserving capability. And, while learning GANs to synthesize images on FL systems has just been demonstrated, it is unknown if GANs for tabular data can be learned from decentralized data sources. Moreover, it remains unclear which distributed architecture suits them best. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding -- risking privacy guarantees. In this paper, we propose Fed-TGAN, the first Federated learning framework for Tabular GANs. To effectively learn a complex tabular GAN on non-identical participants, Fed-TGAN designs two novel features: (i) a privacy-preserving multi-source feature encoding for model initialization; and (ii) table similarity aware weighting strategies to aggregate local models for countering data skew. We extensively evaluate the proposed Fed-TGAN against variants of decentralized learning architectures on four widely used datasets. Results show that Fed-TGAN accelerates training time per epoch up to 200% compared to the alternative architectures, for both IID and Non-IID data. Overall, Fed-TGAN not only stabilizes the training loss, but also achieves better similarity between generated and original data. Our code is released at https://github.com/zhao-zilong/Fed-TGAN.
- Europe > Netherlands > South Holland > Delft (0.05)
- Europe > Switzerland (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
TabGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization
Majee, Anay, Xenochristou, Maria, Chen, Wei-Peng
Handling heterogeneous data in tabular datasets poses a significant challenge for deep learning models. While attention-based architectures and self-supervised learning have achieved notable success, their application to tabular data remains less effective over linear and tree based models. Although several breakthroughs have been achieved by models which transform tables into uni-modal transformations like image, language and graph, these models often underperform in the presence of feature heterogeneity. To address this gap, we introduce T abGLM(T abular G raph Language M odel), a novel multi-modal architecture designed to model both structural and semantic information from a table. TabGLM transforms each row of a table into a fully connected graph and serialized text, which are then encoded using a graph neural network (GNN) and a text encoder, respectively. TabGLM's flexible graph-text pipeline efficiently processes heterogeneous datasets with significantly fewer parameters over existing Deep Learning approaches. Evaluations across 25 benchmark datasets demonstrate substantial performance gains, with TabGLM achieving an average AUC-ROC improvement of up to 5.56% over State-of-the-Art (SoT A) tabular learning methods. 1 Introduction Real-world applications ranging from predicting sales in e-commerce to diagnosing diseases in healthcare rely on tabular data. These datasets are oftentimes a mix of numerical, categorical, and text values, presenting a unique challenge for machine learning models. Traditional approaches (Breiman 2001; Chen and Guestrin 2016; Prokhorenkova et al. 2018) as well as some early Deep Learning (DL) models (Y oon et al. 2020; Arik and Pfister 2021; Gorishniy et al. 2021; Hollmann et al. 2023) convert textual data into numerical encodings modeling only structural features from an input table, leading to loss of semantic information.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.46)
- Overview > Innovation (0.46)
Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class
D'souza, Annie, M, Swetha, Sarawagi, Sunita
Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.
- Banking & Finance (1.00)
- Information Technology (0.67)
TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation
Shi, Juntong, Xu, Minkai, Hua, Harper, Zhang, Hengrui, Ermon, Stefano, Leskovec, Jure
Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. We further introduce a multi-modal stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Code is available at https://github.com/MinkaiXu/TabDiff. Tabular data is ubiquitous in various databases, and developing effective generative models for it is a fundamental problem in many data processing and analysis tasks, ranging from training data augmentation (Fonseca & Bacao, 2023), data privacy protection (Assefa et al., 2021; Hernandez et al., 2022), to missing value imputation (You et al., 2020; Zheng & Charoenphakdee, 2022). With versatile synthetic tabular data that share the same format and statistical properties as the existing dataset, we are able to completely replace real data in a workflow or supplement the data to enhance its utility, which makes it easier to share and use. The capability of anonymizing data and enlarging sample size without compromising the overall data quality enables it to revolutionize the field of data science.
- Asia > China > Beijing > Beijing (0.06)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
TabularFM: An Open Framework For Tabular Foundational Models
Tran, Quan M., Hoang, Suong N., Nguyen, Lam M., Phan, Dzung, Lam, Hoang Thanh
Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Structured Evaluation of Synthetic Tabular Data
Yang, Scott Cheng-Hsin, Eaves, Baxter, Schmidt, Michael, Swanson, Ken, Shafto, Patrick
Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions
Fallahian, Mohammadali, Dorodchi, Mohsen, Kreth, Kyle
In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. This study explores the novel utilization of Generative Adversarial Networks (GANs) in the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships and then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of the GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > North Carolina > Mecklenburg County > Charlotte (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
Zhang, Hengrui, Zhang, Jiani, Srinivasan, Balasubramaniam, Shen, Zhengyuan, Qin, Xiao, Faloutsos, Christos, Rangwala, Huzefa, Karypis, George
Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces TABSYN, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed TABSYN include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that TABSYN outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines.
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Taiwan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime
Afonja, Tejumade, Chen, Dingfan, Fritz, Mario
The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
- North America > United States > Texas (0.06)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > Germany > Saarland > Saarbrücken (0.04)
- Government > Regional Government (0.68)
- Health & Medicine (0.68)
- Information Technology > Security & Privacy (0.46)
Synthetic Demographic Data Generation for Card Fraud Detection Using GANs
Wang, Shuo, Tricco, Terrence, Jiang, Xianta, Robertson, Charles, Hawkin, John
Using machine learning models to generate synthetic data has become common in many fields. Technology to generate synthetic transactions that can be used to detect fraud is also growing fast. Generally, this synthetic data contains only information about the transaction, such as the time, place, and amount of money. It does not usually contain the individual user's characteristics (age and gender are occasionally included). Using relatively complex synthetic demographic data may improve the complexity of transaction data features, thus improving the fraud detection performance. Benefiting from developments of machine learning, some deep learning models have potential to perform better than other well-established synthetic data generation methods, such as microsimulation. In this study, we built a deep-learning Generative Adversarial Network (GAN), called DGGAN, which will be used for demographic data generation. Our model generates samples during model training, which we found important to overcame class imbalance issues. This study can help improve the cognition of synthetic data and further explore the application of synthetic data generation in card fraud detection.
- North America > Canada > Newfoundland and Labrador > Newfoundland > St. John's (0.06)
- Europe > Middle East > Cyprus > Larnaka > Larnaca (0.04)
- Law Enforcement & Public Safety > Fraud (1.00)
- Leisure & Entertainment > Sports > Olympic Games (0.47)